Search CORE

314 research outputs found

VARD2:a tool for dealing with spelling variation in historical corpora

Author: Baron Alistair
Rayson Paul
Publication venue
Publication date: 01/05/2008
Field of study

When applying corpus linguistic techniques to historical corpora, the corpus researcher should be cautious about the results obtained. Corpus annotation techniques such as part of speech tagging, trained for modern languages, are particularly vulnerable to inaccuracy due to vocabulary and grammatical shifts in language over time. Basic corpus retrieval techniques such as frequency profiling and concordancing will also be affected, in addition to the more sophisticated techniques such as keywords, n-grams, clusters and lexical bundles which rely on word frequencies for their calculations. In this paper, we highlight these problems with particular focus on Early Modern English corpora. We also present an overview of the VARD tool, our proposed solution to this problem, which facilitates pre-processing of historical corpus data by inserting modern equivalents alongside historical spelling variants. Recent improvements to the VARD tool include the incorporation of techniques used in modern spell checking software

Lancaster E-Prints

Corpus analysis of key words

Author: Rayson Paul
Publication venue: 'Royal College of Obstetricians & Gynaecologists (RCOG)'
Publication date: 01/01/2019
Field of study

Lancaster E-Prints

Computational tools and methods for corpus compilation and analysis

Author: Rayson Paul Edward
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/05/2015
Field of study

Lancaster E-Prints

A computer-assisted approach to the analysis of metaphor variation across genres.

Author: Hardie Andrew
Koller Veronika
Rayson Paul
Semino Elena
Publication venue: University of Birmingham School of Computer Science
Publication date: 01/01/2005
Field of study

Lancaster E-Prints

Analyzing the Spread of Influenza in Arabic Twitter

Author: Alsudias Lama
Rayson Paul
Publication venue
Publication date: 23/04/2020
Field of study

Lancaster E-Prints

COVID-19 and Arabic Twitter:How can Arab World Governments and Public Health Organizations Learn from Social Media?

Author: Alsudias Lama
Rayson Paul
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 27/10/2020
Field of study

In March 2020, the World Health Organization announced the COVID-19 outbreak as a pandemic. Most previous social media related research has been on English tweets and COVID-19. In this study, we collect approximately 1 million Arabic tweets from the Twitter streaming API related to COVID-19. Focussing on outcomes that we believe will be useful for Public Health Organizations, we analyse them in three different ways: identifying the topics discussed during the period, detecting rumours, and predicting the source of the tweets. We use the k-means algorithm for the first goal with k=5. The topics discussed can be grouped as follows: COVID-19 statistics, prayers for God, COVID-19 locations, advise and education for prevention, and advertising. We sample 2000 tweets and label them manually for false information, correct information, and unrelated. Then, we apply three different machine learning algorithms, Logistic Regression, Support Vector Classification, and Naïve Bayes with two sets of features, word frequency approach and word embeddings. We find that Machine Learning classifiers are able to correctly identify the rumour related tweets with 84% accuracy. We also try to predict the source of the rumour related tweets depending on our previous model which is about classifying tweets into five categories: academic, media, government, health professional, and public. Around (60%) of the rumour related tweets are classified as written by health professionals and academics

Lancaster E-Prints

Developing an Arabic Infectious Disease Ontology to Include Non-Standard Terminology

Author: Alsudias Lama
Rayson Paul
Publication venue: European Language Resources Association (ELRA)
Publication date: 11/05/2020
Field of study

Building ontologies is a crucial part of the semantic web endeavour. In recent years, research interest has grown rapidly in supporting languages such as Arabic in NLP in general but there has been very little research on medical ontologies for Arabic. We present a new Arabic ontology in the infectious disease domain to support various important applications including the monitoring of infectious disease spread via social media. This ontology meaningfully integrates the scientific vocabularies of infectious diseases with their informal equivalents. We use ontology learning strategies with manual checking to build the ontology. We applied three statistical methods for term extraction from selected Arabic infectious diseases articles: TF-IDF, C-value, and YAKE. We also conducted a study, by consulting around 100 individuals, to discover the informal terms related to infectious diseases in Arabic. In future work, we will automatically extract the relations for infectious disease concepts but for now these are manually created. We report two complementary experiments to evaluate the ontology. First, a quantitative evaluation of the term extraction results and an additional qualitative evaluation by a domain expert

Lancaster E-Prints

Lancaster A at SemEval-2017 Task 5: Evaluation metrics matter:predicting sentiment from financial news headlines

Author: Moore Andrew
Rayson Paul Edward
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

This paper describes our participation in Task 5 track 2 of SemEval 2017 to predict the sentiment of financial news headlines for a specific company on a continuous scale between -1 and 1. We tackled the problem using a number of approaches, utilising a Support Vector Regression (SVR) and a Bidirectional Long Short-Term Memory (BLSTM). We found an improvement of 4-6% using the LSTM model over the SVR and came fourth in the track. We report a number of different evaluations using a finance specific word embedding model and reflect on the effects of using different evaluation metrics

arXiv.org e-Print Archive

Crossref

Lancaster E-Prints

OSMAN:a novel Arabic readability metric

Author: El-Haj Mahmoud
Rayson Paul Edward
Publication venue: European Language Resources Association (ELRA)
Publication date: 23/05/2016
Field of study

We present OSMAN (Open Source Metric for Measuring Arabic Narratives) - a novel open source Arabic readability metric and tool. It allows researchers to calculate readability for Arabic text with and without diacritics. OSMAN is a modified version of the conventional readability formulas such as Flesch and Fog. In our work we introduce a novel approach towards counting short, long and stress syllables in Arabic which is essential for judging readability of Arabic narratives. We also introduce an additional factor called “Faseeh” which considers aspects of script usually dropped in informal Arabic writing. To evaluate our methods we used Spearman’s correlation metric to compare text readability for 73,000 parallel sentences from English and Arabic UN documents. The Arabic sentences were written with the absence of diacritics and in order to count the number of syllables we added the diacritics in using an open source tool called Mishkal. The results show that OSMAN readability formula correlates well with the English ones making it a useful tool for researchers and educators working with Arabic text

Lancaster E-Prints

A systematic survey of online data mining technology intended for law enforcement

Author: Edwards Matthew
Rashid Awais
Rayson Paul
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

As an increasing amount of crime takes on a digital aspect, law enforcement bodies must tackle an online environment generating huge volumes of data. With manual inspections becoming increasingly infeasible, law enforcement bodies are optimising online investigations through data-mining technologies. Such technologies must be well designed and rigorously grounded, yet no survey of the online data-mining literature exists which examines their techniques, applications and rigour. This article remedies this gap through a systematic mapping study describing online data-mining literature which visibly targets law enforcement applications, using evidence-based practices in survey making to produce a replicable analysis which can be methodologically examined for deficiencies

Lancaster E-Prints

Explore Bristol Research